We consider task allocation for multi-object transport using a multi-robot system, in which each robot selects one object among multiple objects with different and unknown weights. The existing centralized methods assume the number of robots and tasks to be fixed, which is inapplicable to scenarios that differ from the learning environment. Meanwhile, the existing distributed methods limit the minimum number of robots and tasks to a constant value, making them applicable to various numbers of robots and tasks. However, they cannot transport an object whose weight exceeds the load capacity of robots observing the object. To make it applicable to various numbers of robots and objects with different and unknown weights, we propose a framework using multi-agent reinforcement learning for task allocation. First, we introduce a structured policy model consisting of 1) predesigned dynamic task priorities with global communication and 2) a neural network-based distributed policy model that determines the timing for coordination. The distributed policy builds consensus on the high-priority object under local observations and selects cooperative or independent actions. Then, the policy is optimized by multi-agent reinforcement learning through trial and error. This structured policy of local learning and global communication makes our framework applicable to various numbers of robots and objects with different and unknown weights, as demonstrated by numerical simulations.
translated by 谷歌翻译
In this paper, we present a solution to a design problem of control strategies for multi-agent cooperative transport. Although existing learning-based methods assume that the number of agents is the same as that in the training environment, the number might differ in reality considering that the robots' batteries may completely discharge, or additional robots may be introduced to reduce the time required to complete a task. Therefore, it is crucial that the learned strategy be applicable to scenarios wherein the number of agents differs from that in the training environment. In this paper, we propose a novel multi-agent reinforcement learning framework of event-triggered communication and consensus-based control for distributed cooperative transport. The proposed policy model estimates the resultant force and torque in a consensus manner using the estimates of the resultant force and torque with the neighborhood agents. Moreover, it computes the control and communication inputs to determine when to communicate with the neighboring agents under local observations and estimates of the resultant force and torque. Therefore, the proposed framework can balance the control performance and communication savings in scenarios wherein the number of agents differs from that in the training environment. We confirm the effectiveness of our approach by using a maximum of eight and six robots in the simulations and experiments, respectively.
translated by 谷歌翻译
Humans demonstrate a variety of interesting behavioral characteristics when performing tasks, such as selecting between seemingly equivalent optimal actions, performing recovery actions when deviating from the optimal trajectory, or moderating actions in response to sensed risks. However, imitation learning, which attempts to teach robots to perform these same tasks from observations of human demonstrations, often fails to capture such behavior. Specifically, commonly used learning algorithms embody inherent contradictions between the learning assumptions (e.g., single optimal action) and actual human behavior (e.g., multiple optimal actions), thereby limiting robot generalizability, applicability, and demonstration feasibility. To address this, this paper proposes designing imitation learning algorithms with a focus on utilizing human behavioral characteristics, thereby embodying principles for capturing and exploiting actual demonstrator behavioral characteristics. This paper presents the first imitation learning framework, Bayesian Disturbance Injection (BDI), that typifies human behavioral characteristics by incorporating model flexibility, robustification, and risk sensitivity. Bayesian inference is used to learn flexible non-parametric multi-action policies, while simultaneously robustifying policies by injecting risk-sensitive disturbances to induce human recovery action and ensuring demonstration feasibility. Our method is evaluated through risk-sensitive simulations and real-robot experiments (e.g., table-sweep task, shaft-reach task and shaft-insertion task) using the UR5e 6-DOF robotic arm, to demonstrate the improved characterisation of behavior. Results show significant improvement in task performance, through improved flexibility, robustness as well as demonstration feasibility.
translated by 谷歌翻译
生成的对抗性模仿学习(GAIL)可以学习政策,而无需明确定义示威活动的奖励功能。盖尔有可能学习具有高维观测值的政策,例如图像。通过将Gail应用于真正的机器人,也许可以为清洗,折叠衣服,烹饪和清洁等日常活动获得机器人政策。但是,由于错误,人类示范数据通常是不完美的,这会降低由此产生的政策的表现。我们通过关注以下功能来解决此问题:1)许多机器人任务是目标任务,而2)在演示数据中标记此类目标状态相对容易。考虑到这些,本文提出了目标感知的生成对抗性模仿学习(GA-GAIL),该学习通过引入第二个歧视者来训练政策,以与指示演示数据的第一个歧视者并行区分目标状态。这扩展了一个标准的盖尔框架,即使通过促进实现目标状态的目标状态歧视者,甚至可以从不完美的演示中学习理想的政策。此外,GA-GAIL采用熵最大化的深层P-NETWORK(EDPN)作为发电机,该发电机考虑了策略更新中的平滑度和因果熵,以从两个歧视者中获得稳定的政策学习。我们提出的方法成功地应用于两项真正的布料操作任务:将手帕翻过来折叠衣服。我们确认它在没有特定特定任务奖励功能设计的情况下学习了布料操作政策。实际实验的视频可在https://youtu.be/h_nii2ooure上获得。
translated by 谷歌翻译
用域随机化的深度强化学习在各种模拟中以随机物理和传感器模型参数学习了控制策略,以在零照片的环境中转移到现实世界。但是,由于策略更新的不稳定,当随机参数的范围广泛时,通常需要大量样本来学习有效的政策。为了减轻此问题,我们提出了一种名为环状策略蒸馏(CPD)的样品效率方法。 CPD将随机参数的范围分为几个小子域,并为每个子域分配局部策略。然后,在{\ it循环}将目标子域转变为相邻子域并使用单调策略改善方案来利用邻居子域的学习值/策略时,进行了本地策略的学习。最后,所有博学的本地政策都被蒸馏到SIM到现实转移的全球政策中。 CPD的有效性和样品效率通过四个任务(来自Mujoco的Openaigym和Pusher,游泳者和HalfCheetah的钟形)的模拟来证明,以及一项现实机器人球派遣任务。
translated by 谷歌翻译
最近,许多作品探索了SIM到真实传递的可传递视觉模型预测性控制(MPC)。但是,这样的作品仅限于一次性转移,必须收集一次现实世界的数据才能执行SIM到实现的传输,这仍然是一项重大的人类努力,在将模拟中学到的模型转移到真实的新域中所学的模型世界。为了减轻这个问题,我们首先提出了一个新型的模型学习框架,称为Kalman随机到典型模型(KRC模型)。该框架能够从随机图像中提取与任务相关的内在特征及其动力学。然后,我们建议使用KRC模型的Kalman随机到典型模型预测控制(KRC-MPC)作为零射击的SIM到真实转移视觉MPC。通过仿真和现实世界中的机器人手和模拟中的块配合任务,通过机器人手通过机器人手来评估我们方法的有效性。实验结果表明,KRC-MPC可以以零拍的方式应用于各种真实域和任务。
translated by 谷歌翻译
With deep learning applications becoming more practical, practitioners are inevitably faced with datasets corrupted by a variety of noise such as measurement errors, mislabeling and estimated surrogate inputs/outputs, which can have negative impacts on the optimization results. As a safety net, it is natural to improve the robustness to noise of the optimization algorithm which updates the network parameters in the final process of learning. Previous works revealed that the first momentum used in Adam-like stochastic gradient descent optimizers can be modified based on the Student's t-distribution to produce updates robust to noise. In this paper, we propose AdaTerm which derives not only the first momentum but also all of the involved statistics based on the Student's t-distribution, providing for the first time a unified treatment of the optimization process under the t-distribution statistical model. When the computed gradients statistically appear to be aberrant, AdaTerm excludes them from the update and reinforce its robustness for subsequent updates; otherwise, it normally updates the network parameters and relaxes its robustness for the following updates. With this noise-adaptive behavior, AdaTerm's excellent learning performance was confirmed via typical optimization problems with several cases where the noise ratio is different and/or unknown. In addition, we proved a new general trick for deriving a theoretical regret bound without AMSGrad.
translated by 谷歌翻译
Scenarios requiring humans to choose from multiple seemingly optimal actions are commonplace, however standard imitation learning often fails to capture this behavior. Instead, an over-reliance on replicating expert actions induces inflexible and unstable policies, leading to poor generalizability in an application. To address the problem, this paper presents the first imitation learning framework that incorporates Bayesian variational inference for learning flexible non-parametric multi-action policies, while simultaneously robustifying the policies against sources of error, by introducing and optimizing disturbances to create a richer demonstration dataset. This combinatorial approach forces the policy to adapt to challenging situations, enabling stable multi-action policies to be learned efficiently. The effectiveness of our proposed method is evaluated through simulations and real-robot experiments for a table-sweep task using the UR3 6-DOF robotic arm. Results show that, through improved flexibility and robustness, the learning performance and control safety are better than comparison methods.
translated by 谷歌翻译
大量量化在线用户活动数据,例如每周网络搜索量,这些数据与几个查询和位置的相互影响共同进化,是一个重要的社交传感器。通过从此类数据中发现潜在的相互作用,即每个查询之间的生态系统和每个区域之间的影响流,可以准确预测未来的活动。但是,就数据数量和涵盖动力学的复杂模式而言,这是一个困难的问题。为了解决这个问题,我们提出了FluxCube,这是一种有效的采矿方法,可预测大量共同发展的在线用户活动并提供良好的解释性。我们的模型是两个数学模型的组合的扩展:一个反应扩散系统为建模局部群体之间的影响流和生态系统建模的框架提供了一个模拟每个查询之间的潜在相互作用。同样,通过利用物理知识的神经网络的概念,FluxCube可以共同获得从参数和高预测性能获得的高解释性。在实际数据集上进行的广泛实验表明,从预测准确性方面,FluxCube优于可比较的模型,而FluxCube中的每个组件都会有助于增强性能。然后,我们展示了一些案例研究,即FluxCube可以在查询和区域组之间提取有用的潜在相互作用。
translated by 谷歌翻译
定义和分离癌症亚型对于促进个性化治疗方式和患者预后至关重要。由于我们深入了解,子类型的定义一直在经常重新校准。在此重新校准期间,研究人员通常依靠癌症数据的聚类来提供直观的视觉参考,以揭示亚型的内在特征。聚集的数据通常是OMICS数据,例如与基本生物学机制有很强相关性的转录组学。但是,尽管现有的研究显示出令人鼓舞的结果,但它们却遭受了与OMICS数据相关的问题:样本稀缺性和高维度。因此,现有方法通常会施加不切实际的假设来从数据中提取有用的特征,同时避免过度拟合虚假相关性。在本文中,我们建议利用最近的强生成模型量化量化自动编码器(VQ-VAE),以解决数据问题并提取信息的潜在特征,这些特征对于后续聚类的质量至关重要,仅保留与重建有关的信息相关的信息输入。 VQ-VAE不会施加严格的假设,因此其潜在特征是输入的更好表示,能够使用任何主流群集方法产生出色的聚类性能。在包括10种不同癌症的多个数据集上进行的广泛实验和医学分析表明,VQ-VAE聚类结果可以显着,稳健地改善对普遍的亚型系统的预后。
translated by 谷歌翻译